2019 iT 邦幫忙鐵人賽

DAY 16

AI & Data

30天 python 學習心得分享系列第 16 篇

Day16-爬蟲使用模組介紹-Beautiful Soup 1

2019鐵人賽 day16

wayneli

2018-10-30 18:54:20

9662 瀏覽

分享至

在前片文章中我們已將網頁資料取回，今天要介紹的便是如何解析我們取回來的東西，主要使用的模組為BeautifulSoup，使用Anaconda須先至環境新增這個套件新增方式可以參考Day3，安裝完成後我們就來學習這個套件的使用方式嘍！

Beautiful Soup

Beautiful Soup 是一個 Python 的函式庫模組，可以讓開發者僅須撰寫非常少量的程式碼，就可以快速解析網頁 HTML 碼，從中翠取出使用者有興趣的資料、去蕪存菁，降低網路爬蟲程式的開發門檻、加快程式撰寫速度。

引入與增加HTML範例

透過Anaconda安裝後他會存放在Anaconda下的套件資料夾裡的bs4資料夾底下，所以引入需加上from如下範例：

from bs4 import BeautifulSoup
html = """
<html >
<head>
    <meta charset="UTF-8">
    <title>this is a title</title>
</head>
<body>
<p class="news">123</p>
<p class="contents" id="i1">456</p>
<a href="http://www.baidu.com">advertisements</a>
</body>
</html>
"""
#首先先引入模組並定義一個HTML內容

基本用法解析

soup = BeautifulSoup(html, 'html.parser')

取得HTML Tag資料
透過解析後的物件我們可以針對Tag去得其中包覆的內容

print(soup.head)

#輸出：
<head>
<meta charset="utf-8"/>
<title>this is a title</title>
</head>

print(soup.title)

#輸出：
<title>this is a title</title>

若想取得不包含標籤的內容可在後面加上.text或.string取得文字

print(soup.title.string)

#輸出：
this is a title

格式化HTML prettify()

print(soup.prettify())
#輸出：
<html>
 <head>
  <meta charset="utf-8"/>
  <title>
   this is a title
  </title>
 </head>
 <body>
  <p class="news">
   123
  </p>
  <p class="contents" id="i1">
   456
  </p>
  <a href="http://www.baidu.com">
   advertisements
  </a>
 </body>
</html>

取出HTML中的節點Tag: find_all()
find_all()可找出物件中所有的指定的標籤

p = soup.find_all('p')
print(p)

#輸出：
[<p class="news">123</p>, <p class="contents" id="i1">456</p>]

輸出後的型態會是像List型態的陣列可透過[]取得第幾個標籤：

print(p[0])
#輸出：<p class="news">123</p>

取得多種Tag:

tag = soup.find_all(["a", "p"])
print(tag)
#輸出：
[<p class="news">123</p>, <p class="contents" id="i1">456</p>, <a href="http://www.baidu.com">advertisements</a>]

find_all()預設巡覽所有指定的tag，也可指定尋找幾個：

p = soup.find_all('p', limit=1)
print(p)
#輸出：
[<p class="news">123</p>]

下篇文章會介紹透過更多HTML屬性進行取值的方式，下集待續～～
文章內容如果有錯誤歡迎留言告知，可以幫忙糾正錯誤的觀念，感謝！

Day15- 爬蟲使用模組介紹-requests

Day17-爬蟲使用模組介紹-Beautiful Soup 2

系列文

30天 python 學習心得分享共 30 篇

RSS系列文訂閱系列文

41 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22201 篇

完賽人數

602 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

30天 python 學習心得分享系列 第 16 篇

Day16-爬蟲使用模組介紹-Beautiful Soup 1

Beautiful Soup

引入與增加HTML範例

尚未有邦友留言

標記使用者

30天 python 學習心得分享系列第 16 篇